04 October 2019

Background

Education




  • BSc (Mathematical Sciences)
  • MSc (Statistics)
  • PhD (Computational Statistics)
    • Prof. Nial Friel

Path in Industry - Postdoc Researcher


  • Insight Centre for Data Analytics (UCD)
    • Prof. Andrew Parnell
    • Supervised and Unsupervised Classification
    • Industry Lead Research

Path in Industry - Data Scientist


  • Startup (\(\approx 120\) employees)
  • Ecommerce Analytics
  • 2 Data Scientists
  • Acquired in 2017

Path in Industry - Senior Data Scientist

  • Part of Large Multinational (\(1500+\) employees)
  • \(500+\) employees in Edge Brand
  • Data Science
    • Department of 13 People
      • Data Scientists
      • Data Engineers
      • Data Analysts

  • Parent Company
    • Ascential PLC
    • Originally Publishing
      • East Midland Allied Press (EMAP)
    • Now a Global Information Company
      • Product Design
      • Marketing
      • Sales


  • Combination of 5 legacy businesses
    • Planet Retail (f.2001 - a.2007)
    • RetailNet Group (f.2006 - a.2015)
    • One Click Retail (f.2013 - a.2016)
    • Clavis Insight (f.2008 - a.2017)
    • Brandview (f.2008 - a.2018)

The Data, Insights and Advisory Solution you need to win in an ecommerce-driven world.

  • Ecommerce Analytics
    • Sales
    • Market Share
    • Content
      • In Stock
      • Ratings & Reviews
      • Product Info

Some Data Science Projects in Edge

  • Auto Classification
    • Classifying New Products
  • User Interfaces (Shiny Apps)
    • Cleaning Data
    • Transforming Back-end Files
    • Structured Access to Databases
  • Method of Matching Products (fuzzy text matching)
  • Ad-hoc Projects
    • Predicting Sales Based on Various Metrics

Auto Classification

Problem

  • Classification of Products


  • Ecommerce sites are fast moving and no two are exactly alike.
    • Different stores have different category names and structures.
  • There is a need to provide a standardized category view across multiple sites.
    • Each manufacturer has a different view of categories.
    • Need to be flexible when assigning a product to a category.

Classification of Products

  • Amazon
    • Grocery & Gourmet Food > Candy & Chocolate > Bars
  • Walmart
    • Food > Candy & Gum > Chocolate

Classification of Products


  • When a new product is found on a retailer, the information is collected.
  • Each new product is then classified into it’s correct category.
  • As a manual task this is very time consuming, error prone and requires a lot of manual intervention by users.
    • 5 hours per instance (2016)
    • 350+ instances

Roadmap for Automating Classification


  • Research
    • Find suitable algorithm.
  • Beta testing
    • Test the algorithm within the current process.
  • Integrating code
    • Build the code into the current company platform.

Supervised Classification




  • Multinomial Logistic Regression (Maxent)
    • Reasonably Accurate
    • Quick

Shiny App

Plumber API


  • The plumber library is used to allow code to be called using an API.
    • Most computing languages can call API’s.
    • Arguments in the API point the R code to the database containing the new data.
    • R code classify’s the new data and updates the database with the new categories.

Plumber API

Plumber API

Plumber API

More Info (Poster from CASI 2018)

‘Data Science’

Data Science

  • What does it mean?

  • Disciplines
    • Statistics
    • Computer Science
  • Skill sets
    • Analytics/Statistics
    • ETL (Extract Transform Load)
    • Automation (ML)

Data Science





Working with data!

Job Titles



  • Statistician
  • Data Analyst
  • Data Scientist
  • Data Engineer

Different Disciplines


  • Statistics
    • Statistical Algorithms “Machine Learning”
    • Descriptive Statistics
    • Visualisation
  • Computer Science
    • Implementing Statistical Algorithms
    • Efficient Management of Data (ETL)

Data Scientist

(Formerly a catch all role)



  • A data scientist will have excellent understanding of statistical methods
  • Is able to test and implement methods but does not specialise in this
(Statistician)

Data Engineer




  • A data engineer specialises in implementing techniques
  • Understands the basics of the methods but is not a expert

Demand for Statisticians

Statisticians





  • Is statistics still recognised as being important

LinkedIn Job Adverts


  • Experiment
    • 24th of September 2019
    • 4 search terms used: ‘Statistician’, ‘Data Scientist’, ‘Statistics’, ‘Data Science’.

Load Webpage

library(RSelenium)
rd <- rsDriver(browser = "firefox", port = 4444L)  # Download binaries, start driver
my_session <- rd$client # Create client object
my_session$open()  # Open session

search_terms <- c("Data%20Scientist", "Data%20Science", "Statistics", "Statistician"); term = 1
my_session$navigate(  # Navigate to the page
  paste0("https://ie.linkedin.com/jobs/search?keywords=", search_terms[term], 
         "&location=Dublin%2C%20Ireland&trk=guest_job_search_jobs-search-bar_search-submit&
         redirect=false&position=1&pageNum=0"))

Reveal All Jobs

for (i in 1:20) {  # Loop and click "Load more jobs" button
  btn_available <-   # Check if button still exists 
    tryCatch({
      load_btn <- 
        my_session$findElement(using = "css selector", ".see-more-jobs")
      TRUE
      },error = function(e) FALSE)
  if(!btn_available) break  # End loop if no button
  load_btn$clickElement()  # Click button
  Sys.sleep(runif(1, 3, 5))  # Random wait between 3 and 5 seconds
}

Save Data




my_session$getPageSource()[[1]] %>% # Get HTML and save Data
  writeLines(paste0("data/", format(Sys.time(), "%Y_%m_%d"), "_LI_", 
                    gsub("%20", "_", search_terms[term]), "_Dublin.txt"))  
my_session$close() # Close session
rd[["server"]]$stop()  # stop driver

Parse Data

lapply(c("Statistician", "Data_Scientist", "Statistics", "Data_Science"),
       function(jobtitle){
         paste0("data/2020_03_01_LI_", jobtitle, "_Dublin.txt") %>%  # Filename
           xml2::read_html() %>%  # Read in data as HTML
           lapply(X = 1:700, FUN = function(job_i, raw_html = .){  # Parse Initial HTML
             raw_html %>% rvest::html_nodes(xpath = paste0('/html/body/main/div/section/ul/li[', job_i,']'))
           }) %>%
           lapply(function(main_html){  # Parse sub elements of HTML
             c(main_html %>% rvest::html_nodes(xpath = 'a') %>%  # Title
                 rvest::html_text() %>% ifelse(test = length(.) > 0, ., NA),
               main_html %>%  rvest::html_nodes(xpath = 'div[1]/h4/a') %>%  # Company
                 rvest::html_text() %>% ifelse(test = length(.) > 0, ., NA),
               main_html %>% rvest::html_nodes(xpath = 'div[1]/div') %>%  # Description
                 rvest::html_text( ) %>% ifelse(test = length(.) > 0, ., NA),
               jobtitle)
             }) %>% 
           do.call(what = rbind, .)  # Combine data for each term
       }) %>%
  do.call(rbind, .) %>%  # Combine 3 data sets
  as.data.frame(stringsAsFactors  = F) %>%  # Create data frame
  dplyr::select(Title = V1, Company = V2, Text = V3, SearchTerm = V4) %>%  # Rename variables
  dplyr::filter(!is.na(Title)) -> job_data  # Remove NA's

LinkedIn Job Adverts

  • 24th of September 2019 01 March 2020
  • 4 search terms used: ‘Statistician’, ‘Data Scientist’, ‘Statistics’, ‘Data Science’.
  • Total of 498 unique jobs returned.
Search Term Results
Statistician 11
Data_Science 25
Data_Scientist 260
Statistics 444

Job Titles

Word Clouds (Job Descriptions)

Statistician

Data Scientist

Statistics

Data Science

Thank you

Tips/Tricks Learned While Making This Presentation

R Markdown Graphs

```{r}
ggplot(mpg ,aes(displ, cty, colour = class)) +
  geom_point()
```

R Markdown Graphs

```{r, fig.retina = 4, dev.args = list(bg = 'transparent')}
ggplot(mpg ,aes(displ, cty, colour = class)) +
  geom_point() + theme(plot.background = element_rect(fill = "transparent", color = NA))
```